Text Mining Kickstarter Projects

Overview

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”.

Kickstarter has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

For this assignment, I analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using text mining techniques.

Data

The dataset for this assignment is taken from webroboto.io ‘s repository. They developed a scrapper robot that crawls all Kickstarter projects monthly since 2009. We will just take data from the most recent crawl on 2018-02-15.

The data is contained in the file kickstarter_data.csv and contains about 150,000 projects and about 20 variables.

ks <- read_csv("../data/kickstarter_projects.csv")

1. Identifying Successful Projects

a) Success by Category

There are several ways to identify success of a project: - State (state): Whether a campaign was successful or not. - Pledged Amount (pledged) - Achievement Ratio: Create a variable achievement_ratio by calculating the percentage of the original monetary goal divided by the actual amount pledged.
- Number of backers (backers_count) - How quickly the goal was reached (difference between launched_at and state_changed_at) for those campaigns that were successful.

Use one or more of these measures to visually summarize which categories were most successful in attracting funding on kickstarter. Briefly summarize your findings.

To defined “success,” I chose to create a ratio of pledged amount of money over the monetary goal of the campaign. I removed any campaign with a goal of under $30 to ensure results were a little less skewed and eliminated any campaigns deemed unsuccessful since many of these campaigns dropped out or discontinued. I then broke out the results into categories of campaigns; we can see below that all campaign categories received an average of over 100% of their goal money, which is surprising! The technology category seems to have the most success while art campaigns were still highly successful (over 100% of goal money on average) but less exhorbitantly successful than other campaign categories.

ks$goal <- round(ks$goal, 0)
ks <- ks[!duplicated(ks[, c('id')]),]
ks <- ks %>% 
  mutate(achievement = (converted_pledged_amount/goal)* 100)
ks$achievement <- round(ks$achievement, 2)

# by achievement score (proportion of pledged amount over goal)
top20 <- ks %>% 
  arrange(desc(achievement)) %>% 
  filter(goal >= 30) %>% 
  slice(1:20)

# by number of backers
top10 <- ks %>% 
  arrange(desc(backers_count)) %>% 
  slice(1:10)


# by category
cat_success <- na.omit(ks) %>% 
  filter(goal >= 100 & state == "successful") %>% 
  group_by(top_category) %>% 
  summarize(avg = mean(achievement)) %>% 
  arrange(desc(avg)) 
  
ggplot(data = cat_success, aes(x = reorder(top_category, avg), y = avg)) +
geom_bar(stat = "identity", fill = "#a8ddb5", alpha = 0.8) + 
  coord_flip() + theme_tufte() + 
  geom_text(aes(label = top_category, x=top_category, y = 1), hjust = 0,
            size = 4, family = 'serif', color = "black") +
  theme(axis.title = element_blank(), 
        axis.text.y =  element_blank(), 
        axis.ticks = element_blank()) +
  scale_y_continuous(labels=function(x) paste0(x, "%")) +
  labs(x = '', 
       title = "Most Successful KickStarter Campaigns: Category", 
       subtitle = "Highest proportions of backer money to goal on average")

b) Success in San Francisco

I was curious to visualize success in San Francisco because it’s a tech hub and a hot spot for Kickstarter campaigns. Most campaigns from San Francisco recieved over 100% of their goal from their backers, so “success” took on a new definition of popularity. Larger circles had more backers and circles placed higher up on the graph received the most pledged money. Scroll over each bubble for more information about the project.

# filtered for goal over 100 this time
ks_sf <- ks %>% 
  filter(goal > 100 & backers_count > 10 & state == "successful" & location_town == "San Francisco") 
data(ks_sf)

# high charter tooltip 
x <- c("Name", "Backers:  ", "Pledged $: ", "Goal $: ", "Success Ratio %: ")
y <- sprintf("{point.%s}", c("name", "backers_count", "converted_pledged_amount", "goal", "achievement"))
tltip <- tooltip_table(x, y)

colors <- c("#0868ac", "#084081", "#2b8cbe", "#7bccc4","#a8ddb5")

ks_sf$color <- colorize((log(ks_sf$achievement)), colors)

hchart(ks_sf, "scatter", hcaes(goal, pledged, size = backers_count, color = color)) %>% 
  hc_chart(backgroundColor = "white") %>% 
  hc_title(text = "Kickstarter Projects") %>% 
  hc_subtitle(text = "San Francisco, 2009 - 2018") %>% 
  hc_tooltip(useHTML = TRUE, 
             headerFormat = '', 
             pointFormat = tltip) %>% 
  hc_size(height = 500) %>% 
  # hc_legend(layout = "vertical", verticalAlign = "top",
  #           align = "right", valueDecimals = 0) %>% 
  hc_size(height = 500, width = 600) %>% 
  hc_yAxis(
    title = list(text = "Pledged Amount $", gridLineWidth = 0.5)) %>% 
  hc_xAxis(
    title = list(text = "Goal Amount $", gridLineWidth = 0.5)) %>% 
  hc_add_theme(hc_theme_tufte())

2. Writing your success story

Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.

a) Cleaning the Text and Word Cloud

I selected the top 1000 most successful projects and the bottom 1000 least successful projects, then removed stop words, punctuation and numbers. I chose to not stem the words as to leave more of the meaning in the word cloud. I then created a document-term-matrix and term-document-matrix and visualized the most frequent words among the 1000 most successful projects in a word cloud.

Most common words throughout all 2000 projects include “game,” “art,” “designed” and “homemade.” Many of the kickstarter projects must be game-related, and it’s not surprising that many projects are homemade and artistically designed!

# choose top 1000 and bottom 1000 projects
top1000 <- ks %>% 
  filter(goal > 100 & state == "successful") %>% 
  arrange(desc(achievement)) %>% 
  mutate(success = "most successful") %>% 
  slice(1:1000)

bottom1000 <- ks %>% 
  filter(goal > 100) %>% 
  arrange(-desc(achievement)) %>% 
  mutate(success = "least successful") %>% 
  slice(1:1000)

projects <- rbind(top1000, bottom1000)
projects <- projects %>% 
  select(id, blurb)
projects <- as.data.frame(projects)

# lowercase, rid of punctuation and stopwords, chose not to stem words.
blurb_words <- projects %>%
  unnest_tokens(word, blurb, to_lower = TRUE) %>% 
  #mutate(word = wordStem(word)) %>% 
  group_by(id) %>% 
  count(word, sort = FALSE)
blurb_words <- blurb_words %>% 
  anti_join(stop_words) 
blurb_words$word <- removeNumbers(blurb_words$word)
blurb_words$word <- removePunctuation(blurb_words$word)
blurb_words$word <- qdap::Filter(blurb_words$word, min = 2, max = Inf, count.apostrophe = TRUE,
  stopwords = NULL, ignore.case = TRUE) # get rid of words less than 2 letters long

# create document term matrix
dtm <- blurb_words %>%
  tidytext::cast_dtm(id, word, n)
dtm_mat <- as.matrix(dtm)

# create term document matrix
tdm <- blurb_words %>% 
  cast_tdm(word, id, n)
tdm_mat <- as.matrix(tdm)

# all most successful docs together
blurb_words_succ <- blurb_words %>%
  filter(id %in% top1000$id) %>% 
  group_by(word) %>% 
  count(word, sort = FALSE) %>% 
  filter(nn < 300) %>% 
  ungroup()
  

set.seed(2018)
wordcloud(words = blurb_words_succ$word, freq = blurb_words_succ$nn, scale = c(5, .5), min.freq = 5,
          max.words = 200, random.order = FALSE, rot.per = 0.35,
          colors=brewer.pal(8, "Dark2"))

b) Success in words

I then created a pyramid plot to show how the words between the most successful and unsuccessful projects differ in frequency. I selected the top 20 words used by the 1000 most successful and least successful campaigns.

# df for unsuccessful projects
blurb_words_unsucc <- blurb_words %>%
  filter(id %in% bottom1000$id) %>% 
  group_by(word) %>% 
  count(word, sort = TRUE) %>% 
  filter(nn != 236) %>% 
  ungroup()

# differentiate between successful and unsuccessful
blurb_words_unsucc$success <- "Unsuccessful"
blurb_words_succ$success <- "Successful"

# slice into top 20
blurb_words_succ20 <- blurb_words_succ %>% 
dplyr::arrange(desc(nn)) %>% 
  dplyr::slice(1:20)
blurb_words_unsucc20 <- blurb_words_unsucc %>% 
arrange(desc(nn)) %>% 
  slice(1:20)

# combine data frames
blurb_wrds_20 <- rbind(blurb_words_unsucc20, blurb_words_succ20)
blurb_wrds_20$nn <- ifelse(blurb_wrds_20$success == "Unsuccessful", blurb_wrds_20$nn * -1, blurb_wrds_20$nn) # turn unsuccessful numbers negative

blurb_wrds_20$row <- row_number(blurb_wrds_20$word)

# pyramid plot
ggplot(blurb_wrds_20, aes(x = reorder(word, nn), y = nn, fill = success)) +
  geom_bar(data = dplyr::filter(blurb_wrds_20, success == "Successful"), stat = "identity") +  
  geom_bar(data = dplyr::filter(blurb_wrds_20, success == "Unsuccessful"), stat = "identity") + 
  coord_flip() +
  scale_fill_manual(values=c("#a8ddb5", "#0868ac"), name = '') +
  labs(y = '', x = '') +
  theme_tufte() +
  theme(axis.ticks = element_blank()) +
  theme(legend.position = "top")

c) Simplicity as a virtue

These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. I calculated a readability measure (Flesh Kincaid) for the texts defined by the equation below (from Wikipedia):

Flesch Kincaid produces a number that defines how easy or difficult a chunk of text is based on what grade level in school should be able to read it. Anything that receives over a 90 is considered easy to read (a 5th grader could read it) and anything below a 30 is considered difficult and for college students to be able to read and understand. I then visualized the relationship between the Flesch Kincaid measure of the short blurb of the campaign and the number of backers each campaign had.

From the visualization, we can see that most heavily-backed projects were in between -13 to 80 in terms of Flesch Kincaid scores. That’s a very large range! Most projects in general were left-skewed, but a few blurbs were so difficult to read they received negative scores. Those projects didn’t have the most backers but still received over a thousand.

Hover over the circles for more information about each project.The larger the circle, the more backers the project received.

#blurb_words_2000 <- rbind(blurb_words_succ, blurb_words_unsucc)
# calculate syllables and sentences 
fk <- blurb_words
fk$syllable <- syllable_sum(fk$word)
fk$syllable[is.na(fk$syllable)] <- 0
projects$sentences <- nsentence(projects$blurb)
fk <- fk %>% 
  group_by(id) %>% 
  mutate(total_words = length(word), total_syllables = sum(syllable)) %>% 
  ungroup()

fk <- merge(fk, projects, by = "id")

# calculating flesch reading ease scores:
fk$fk_score <- (206.835 - (1.015*(fk$total_words/fk$sentences)) - (84.6*(fk$total_syllables/fk$total_words)))

# summarizing by ids
na.omit(fk <- fk %>% 
  group_by(id) %>% 
  select(id, fk_score) %>% 
  unique())
## # A tibble: 1,994 x 2
## # Groups:   id [1,994]
##         id fk_score
##      <int>    <dbl>
##  1  637867    19.0 
##  2 1814172    50.7 
##  3 2223811     4.92
##  4 2240943    14.3 
##  5 4152204   119   
##  6 5665319    57.9 
##  7 5916690    32.1 
##  8 6009087    91.8 
##  9 6023841    11.1 
## 10 6081847    49.5 
## # ... with 1,984 more rows
# joining previous success info 
fk <- merge(fk, ks, by = "id")
fk <- fk %>% 
  filter(goal > 100 & backers_count > 10 & achievement < 3600)

x <- c("Name", "Backers:  ", "Pledged $: ", "Goal $: ", "Success Ratio %: ", "Flesch-Kinkaid Readability Score:")
y <- sprintf("{point.%s}", c("name", "backers_count", "converted_pledged_amount", "goal", "achievement", "fk_score"))
tltip <- tooltip_table(x, y)

colors <- c("#0868ac", "#084081", "#2b8cbe", "#7bccc4","#a8ddb5")

fk$color <- colorize((log(fk$backers_count)), colors)

hchart(fk, "scatter", hcaes(fk_score, backers_count, size = backers_count, color = color)) %>% 
  hc_chart(backgroundColor = "white") %>% 
  hc_title(text = "Kickstarter Projects 2009 - 2018") %>% 
  hc_subtitle(text = "Popularity & Readability Score of Summary Blurb") %>% 
  hc_tooltip(useHTML = TRUE, 
             headerFormat = '', 
             pointFormat = tltip) %>% 
  hc_size(height = 500) %>% 
  hc_size(height = 500, width = 600) %>% 
  hc_yAxis(
    title = list(text = "# of Backers", gridLineWidth = 0.5)) %>% 
  hc_xAxis(
    title = list(text = "Flesch-Kinkaid Score (difficult to easy)", gridLineWidth = 0.5)) %>% 
  hc_add_theme(hc_theme_tufte())

3. Sentiment

Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.

a) Stay positive

I calculate the tone of each text using the AFINN lexicon based on the positive and negative words used in each blurb. AFINN assigns a positive or negative score to each word (between -5 and 5) depending on how positive or negative the sentiment attached to the word is. I calculated a cumulative measure of negativity or positivity per blurb by adding up the scores on each individual word. Below, I visualized the relationship between the tone of the blurb and success.

While quite similar in nature in terms of the number of and magnitude of positive and negative words used in each blurb, less successful projects used a larger range of extreme positive or extreme negative words. Both successful and unsuccessful projects mainly use positive words (with a score above 0 averaging around a score of 3) overall.

# join lexicon
afinn <- get_sentiments("afinn")  
word_sent <- merge(x = blurb_words, y = ks, by = "id", all.x = TRUE)
word_sent <- word_sent %>% 
  select(id, name, word, n, created_at, pledged, backers_count, goal, achievement, blurb) %>% 
  inner_join(afinn)
word_sent$score <- (word_sent$score * word_sent$n)

# assign labels to most and least successful
word_sent <- word_sent %>% 
  dplyr::filter(id %in% top1000$id| id %in% bottom1000$id)
word_sent$success <- ifelse(word_sent$id %in% top1000$id, "Most Successful", "Least Successful")


# x = 0
# for (i in seq(1:985)) {
#   blurb_sent_succ$cumm_score[i] <- blurb_sent_succ$score[i] + x
#   x =  blurb_sent_succ$cumm_score[i]
# }

# assign scores to blurbs as a whole based on word scores
blurb_sent <- word_sent %>% 
  group_by(id) %>% 
  mutate(txt_score = sum(score)) %>% 
  select(id, txt_score, backers_count, success, blurb) %>% 
  unique() %>% 
  ungroup()

# plot
ggplot(data = blurb_sent, aes(x = success, y = txt_score)) +
  geom_violin(aes(fill = success, alpha = 0.8)) +
  coord_flip() +
  theme_tufte() +
  scale_fill_manual(values=c("#0868ac","#a8ddb5")) +
  labs(x = '', y = '\n Negative to Positive Scoring Words', 
       title = 'Positive and Negative Words in Campaign Blurbs', 
       subtitle = 'Unsuccessful campaigns have larger range of extreme positive and extreme negative sentiments') +
  theme(axis.ticks = element_blank(),
        legend.position = "none")

b) Positive vs negative

I segregated all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Then, I created a term-document-matrix in order to generate a comparison cloud showing the most-frequent positive and negative words.

Negative words most frequently include the use of “limited,” which is funny only because often times a “limited edition” is seen as a good thing! Other frequent negative terms include “hard,” “battle,” “lost” and “fight.”

Frequent positive words include “inspired,” “love,” “easy” and “share.”

# neg and pos grouping 
word_sent$negpos <- ifelse(word_sent$score < 0, "negative", "positive")

# create tdm 
tdm_pn <- word_sent %>% 
  cast_tdm(word, negpos, n)
tdm_pn_mat <- as.matrix(tdm_pn)

# comparison cloud
comparison.cloud(tdm_pn_mat, colors = c("#a8ddb5", "#0868ac"), scale = c(5, 0.5), max.words = 500, title.size = 2, rot.per = 0.35)

c) Get in their mind

Lastly, I used the NRC Word-Emotion Association Lexicon to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). I visualized the relationship between success and range of emotions in two different ways; first with a bar chart and second with a lollipop chart. We can see in both that successful compaigns tend to use more of every type of emotional word except for “joy,” for which unsuccessful campaigns use more.

# join nrc sentiment
nrc <- get_sentiments("nrc")  
word_sent <- word_sent %>% 
  inner_join(nrc) %>% 
  select(id, name, word, n, achievement, success, sentiment) %>% 
  group_by(success, sentiment) %>% 
  summarize(n = sum(n)) %>% 
  ungroup()

# plot
ggplot(data = word_sent, aes(x = reorder(sentiment, -n), y = n, fill = success)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_tufte() +
  scale_fill_manual(values=c("#0868ac","#a8ddb5"), name = '') +
  labs(x = '', y = '\n # of words', 
       title = 'Sentiment of Campaign Blurbs', 
       subtitle = 'Unsuccessful campaigns express joy ') +
  theme(axis.ticks = element_blank(),
        legend.position = "bottom")

The lollipop chart makes it easy for us to view the true difference in number of words per emotion for successful and unsuccessful campaigns. There’s the largest difference in use of negative words; successful campaigns use plenty more of these than unsuccessful campaigns– maybe the use of negative words adds to the drama of the project and leads to more backers?

# factor for ordering levels
word_sent$sentiment <- factor(word_sent$sentiment, levels = c("positive", "joy", "trust", "anticipation", "negative", "anger", "surprise", "sadness", "disgust", "fear"))

# plot
ggplot(data = word_sent, aes(x = n, y = sentiment, color = success)) +
  geom_segment(aes(x = 0, y = sentiment, xend = n, yend = sentiment), color = "lightgrey") +
  geom_point(size = 2) +
  scale_colour_manual(values=c('#8c510a','#01665e'), name = '') +
  theme_tufte() +
  labs(x = "Number of Words", y = "", title = 'Sentiment of Campaign Blurbs', 
       subtitle = 'Unsuccessful campaigns express joy more than successful campaigns')